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Generational Garbage Collection 
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■ Young objects die quickly 

■ Nursery 

• Traced for live objects 

• Copy to mature space 

• Reclaimed ‘en masse’ 
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Concurrent Garbage Collector 
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Understanding the BIG core’s performance advantage 
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Understanding the BIG core’s performance advantage 
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Managed Multi-threaded Applications 
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■ Extensions to work with JVM 

• Works with JIT compiler 

• Emulate System Calls (futex & nanosleep) 

• JVM-simulator communication with new instruction 

■ Simulates 

• x86, cycle-level, parallel, high-speed 

• Multicore, heterogeneous 


• Different frequencies 

• McPat for power 



sniper 



Methodology 


■ Sniper simulator 

■ Jikes RVM 3.1.2 and DaCapo benchmarks 

• Collector 

• Generational Immix garbage collector 

• Concurrent mark-sweep snapshot algorithm 


• 2x minimum heap 

• Replay compilation, 2 nd invocation 
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Cooperative Cache Scrubbing 
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Cooperative Cache Scrubbing 


■ Communicate managed language’s 
semantic information to hardware 

■ Caches 

• ‘Scrub’ dead lines 

• Zero lines without fetch 

> Result 



> Better cache management 

> Avoid traffic to DRAM 

> Save DRAM energy 



SW-HW Cooperative Scrubbing 


■ Software 

• Identify cache line-aligned dead/zero region 

• Generational Immix collector (stop-the-world) 

♦ After nursery collection, call scrub instruction on each 
line in entire range 

♦ Call zero instructions to zero region (32KB) 

■ Hardware 
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SW-HW Cooperative Scrubbing 


■ Software 

■ Hardware 


Scrubbing (LLC) 

♦ clinvalidate: invalidates cache line 

♦ clundirty: clears dirty bit 


PowerPC’s dcbi, ARM 



^clcleannclears dirty bit, moves line to LRU 

Zeroing (L2) 

♦ clzero: zero cache line without fetch 


PowerPC’s dcbz 


• Modifications to MESI cache coherence protocol 

♦ Back-propagation from LLC to L1/L2 cache levels 

♦ Local coherence transitions (no off-chip) 
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Methodology 


Sniper simulator 

• 4 cores, 8MB shared L3 (LLC), McPAT 

Jikes RVM 3.1.2 and DaCapo benchmarks 

• Generational Immix garbage collector 

• 4 application, 4 GC threads 
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clclean+clzero Improvements 
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Related Work 


■ Cooperative cache management 

• ESKIMO by Isen & John, Micro 09 

♦ Useless reads and writes to DRAM by sequential C 
programs 

♦ Reduce energy 

♦ Require large map in hardware, extra cache bits 

• Wang et al., PACT 02/ ISCA 03; Sartor et al., 05 

♦ C & Fortran static analysis to give cache hints to evict or 
keep data 

■ Zero initialization [Yang et al., OOPSLA 11] 

• Studied costs in time, cache and traffic 

• Use non-temporal writes to DRAM, increase bandwidth 


p. 35 


Y>-^SE 




UNIVERSITEIT 


CENT 


Conclusions 


Software-hardware cooperative cache 
scrubbing 

• Leverages region allocation semantics 

• Changes to MESI coherence protocol 

• New multicore architectural simulatior^^ 
methodology 

• Reductions 

> 59% traffic 

> 14% DRAM energy 

> 4.6% execution time 
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